gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64 by mpage · Pull Request #132295 · python/cpython

mpage · 2025-04-09T00:02:23Z

#131750 mysteriously caused a ~6% regression for the free-threaded build. The cause was poor code generation of opcode dispatch in the interpreter loop. Before the change the dispatch code looked like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19cd0a: mov    -0x268(%rbp),%rsi
          19cd11: movzbl %ah,%ecx
          19cd14: movzbl %al,%eax
          19cd17: mov    %ecx,%r10d
          19cd1a: jmp    *(%rsi,%rax,8)

After the change, the dispatch code looked like:

# Shared dispatch code
/root/src/cpython/Python/generated_cases.c.h:81 [BINARY_OP]
            DISPATCH();

          19dd67: mov    -0x280(%rbp),%r10
          19dd6e: movzbl %ah,%ecx
          19dd71: movzbl %al,%eax
          19dd74: mov    %ecx,%r14d
          19dd77: mov    -0x270(%rbp),%rcx
          19dd7e: mov    (%rcx,%rax,8),%rdx
          19dd82: nopw   0x0(%rax,%rax,1)
          19dd88: movq   -0x258(%rbp),%xmm0
          19dd90: movq   %r12,%xmm4
          19dd95: punpcklqdq %xmm4,%xmm0
          19dd99: movhlps %xmm0,%xmm3
          19dd9c: movq   %xmm0,%r15
          19dda1: movq   %xmm3,%r11
          19dda6: mov    %r11,%rcx
          19dda9: jmp    *%rdx
          
# Duplicated dispatch code
/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19dde4: movzbl %ah,%ecx
          19dde7: movzbl %al,%eax
          19ddea: mov    %ecx,%r14d
          19dded: mov    -0x270(%rbp),%rcx
          19ddf4: mov    (%rcx,%rax,8),%rdx
          19ddf8: jmp    19dd99 <_PyEval_EvalFrameDefault+0x289>

There are two problems:

We now have two jumps (one direct jump to the shared dispatch logic and one indirect jump to the next opcode handler) instead of one (the indirect jump to the opcode handler).
There's a significant amount of register shuffling in the shared dispatch code.

Both of these problems appear to be caused by GCC's SLP autovectorizer. After the change, it decides to store both the next_instr pointer and the stack_pointer in a single 128 bit register in the shared basic block that contains the opcode dispatch. This is introduced in the slp1 pass (tree dump below):

  _24061 = VIEW_CONVERT_EXPR<long unsigned int>(stack_pointer_14587);
  _24062 = VIEW_CONVERT_EXPR<long unsigned int>(next_instr_14097);
  _24063 = {_24062, _24061};

  <bb 19> [count: 1658034300]:
  # frame_2363(ab) = PHI <frame_20485(4258), frame_20519(18)>
  # oparg_1245(ab) = PHI <oparg_20252(4258), oparg_14635(18)>
  # next_instr_1246(ab) = PHI <next_instr_11924(4258), next_instr_14097(18)>
  # stack_pointer_2976(ab) = PHI <stack_pointer_20484(4258), stack_pointer_14587(18)>
  # _3209 = PHI <_20217(4258), _20681(18)>

  # 
  # Combination of next_instr and stack_pointer:
  # 

  # vect_next_instr_1246.7061_24064 = PHI <vect_next_instr_11924.7060_24060(4258), _24063(18)>
  _24067 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 64>;
  _24068(ab) = (union _PyStackRef *) _24067;
  _24065 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 0>;
  _24066(ab) = (union _Py_CODEUNIT *) _24065;

  # DEBUG stack_pointer => stack_pointer_2976(ab)
  # DEBUG next_instr => next_instr_1246(ab)
  # DEBUG oparg => oparg_1245(ab)
  # DEBUG frame => frame_2363(ab)
  goto _3209;

Disabling the SLP autovectorization pass for the interpreter loop fixes both problems. After this change the opcode dispatch code looks like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19aa37: mov    -0x260(%rbp),%rsi
          19aa3e: movzbl %ah,%ecx
          19aa41: movzbl %al,%eax
          19aa44: movslq %ecx,%r15
          19aa47: jmp    *(%rsi,%rax,8)

Performance improves by ~8% for the free-threaded build.

Surprisingly, this also seems to improve performance for the default build by ~4%. I don't understand why and I don't fully trust the result. The generated dispatch code for the default build looks unaffected by this change. Additionally, measuring instructions retired using fastbench shows a negligible change, whereas it shows a ~8% reduction for the free-threaded build.

Issue: computed-goto interpreter: Prevent the compiler from merging DISPATCH calls #129987

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295
mpage merged 1 commit intopython:mainfrom
mpage:gh-129987-no-slp-vectorize

mpage commented Apr 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mpage commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mpage commented Apr 9, 2025 •

edited

Loading